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Executive  Summary 


Function  Extraction  (FX)  is  an  emerging  technology  that  can  be  applied  to  automated  calcu¬ 
lation  of  the  functional  behavior  of  software  for  improved  human  understanding  and  analysis 
[Pleszkoch  04,  Hevner  05].  To  better  understand  the  impact  of  FX  on  software  comprehen¬ 
sion  and  verification,  a  rigorous,  controlled  experiment  was  performed  to  compare  traditional 
manual  methods  of  comprehension  with  automated  behavior  computation  using  an  FX  proto¬ 
type.  The  experiment  required  26  experienced  Java  programmers  (13  using  traditional  tech¬ 
niques  and  13  using  FX  automation)  to  evaluate  the  behavior  of  three  small  programs  of  low- 
to-moderate  complexity.  The  following  observations  summarize  the  experimental  results: 

•  Use  of  the  FX  prototype  greatly  reduces  the  time  required  to  derive  program  behav¬ 
ior  by  automating  this  crucial  and  time-consuming  part  of  program  comprehension. 

Subjects  using  traditional  manual  methods  spent  significantly  more  time  (between  62% 
and  81%  of  total  task  time)  reading  and  analyzing  code  to  determine  program  behavior, 
and  this  percentage  increased  as  the  programs  became  longer  and  more  difficult.  In  con¬ 
trast,  subjects  using  FX  automation  determined  program  behavior  directly  from  the  proto¬ 
type  output  and  thus  spent  very  little  time  (between  .2%  and  .3%  of  total  task  time)  on 
program  comprehension.  This  represents  an  improvement  of  several  orders  of  magnitude. 

•  Use  of  the  FX  prototype  improves  human  performance  in  program  comprehension. 

The  subjects  who  used  the  FX  prototype  produced  significantly  more  correct  answers  to 
comprehension  and  verification  questions  than  the  subjects  using  manual  techniques;  in 
the  case  of  the  longest,  most  difficult  program  correct  answers  increased  by  a  factor  of 
3.6.  The  FX  group  also  required  significantly  less  time  to  achieve  this  improved  compre¬ 
hension.  This  difference  grew  as  the  programs  increased  in  length  and  difficulty,  ulti¬ 
mately  resulting  in  a  reduction  of  time  required  by  a  factor  of  4.2.  Treating  productivity 
as  a  ratio  of  task  output  to  input  (output  =  accurate  program  comprehension,  input  =  total 
time  on  task),  the  FX  group  achieved  a  productivity  improvement  on  the  order  of  a  factor 
of  15  for  the  longest  and  most  difficult  program. 

•  Developers  who  were  trained  on  and  used  the  FX  prototype  agree  that  it  was  useful, 
supports  the  comprehension  task,  and  is  easy  to  use.  The  large  improvement  in  pro¬ 
gram  comprehension  and  productivity  for  the  group  using  the  FX  prototype  was  achieved 
with  just  45  minutes  of  instruction,  contrasted  with  years  of  training  and  experience  in 
program  reading  and  inspection  for  the  group  using  traditional  methods. 

Standard  statistical  tests  of  the  significance  of  the  experimental  data  indicate  extremely  low 
probabilities  that  these  results  could  be  attributed  to  chance,  in  some  cases  computed  as  zero 
to  three  decimal  places. 
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Abstract 


Function  Extraction  (FX)  is  a  new,  theory-based  technology  for  automated  calculation  of  the 
functional  behavior  of  software.  The  CERT  Function  Extraction  experiment  was  conducted 
so  as  to  better  understand  the  impact  of  FX  on  human  comprehension  and  verification  of  soft¬ 
ware  and  to  rigorously  quantify  the  business  case  for  FX  technology.  This  report  describes 
the  results  of  the  controlled  experiment  that  was  performed  to  compare  traditional  manual 
methods  of  comprehension  with  automated  behavior  computation  using  an  FX  prototype. 

The  results  of  the  experiment  show  a  substantial  increase  in  human  capabilities  for  software 
comprehension  and  verification  using  FX  technology. 
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1  Function  Extraction  Research  Motivation 


Because  of  the  size  and  complexity  of  programs,  current-generation  software  engineering 
must  operate  in  a  world  of  incomplete  knowledge  of  program  behavior.  No  practical  means 
exist  for  programmers  to  determine  the  full  functional  behavior  of  sizable  programs  in  all 
circumstances  of  use,  and  no  testing  effort,  no  matter  how  extensive,  can  exercise  more  than 
a  small  fraction  of  possible  behavior.  Lacking  better  technology,  behavior  discovery  today  is 
a  haphazard  and  time-consuming  drain  on  resources  carried  out  by  manual  techniques  of  pro¬ 
gram  reading  and  inspection  with  unavoidable  human  fallibility.  Yet  comprehensive  knowl¬ 
edge  of  software  behavior  is  essential  for  fast  and  correct  development,  testing,  maintenance, 
and  evolution  of  programs. 

While  this  problem  is  pervasive  today,  it  need  not  be  so  in  the  future.  A  key  enabling  capabil¬ 
ity  for  next-generation  software  engineering  is  the  transformation  of  program  behavior  analy¬ 
sis  from  an  error-prone,  resource-intensive  process  in  human-time  scale  into  a  precise,  auto¬ 
mated  calculation  in  CPU-time  scale.  The  emerging  technology  of  Function  Extraction  holds 
promise  towards  making  this  capability  a  reality. 
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2  Concepts  of  Function  Extraction 


Function  Extraction  (FX)  deals  with  the  semantics  of  software  behavior.  All  levels  of  abstrac¬ 
tion  in  the  development  of  software  systems  embody  behavioral  semantics,  from  low-level 
machine  language  operations  to  high-level  system  capabilities.  As  software  systems  are  de¬ 
veloped  and  evolve  over  time,  semantic  content  is  continuously  created,  intentionally  or  unin¬ 
tentionally,  correct  or  incorrect.  Effective  development  and  evolution  of  a  system  depends  on 
how  well  its  behavioral  semantics  are  understood.  The  complexity  and  quantity  of  accumu¬ 
lated  behavioral  semantics  can  overwhelm  developers,  leading  to  loss  of  intellectual  control. 

The  ultimate  goal  of  Function  Extraction  is  to  calculate  full  semantic  behavior  at  all  levels  of 
system  abstraction,  from  specification  to  design  to  implementation.  This  goal  requires  auto¬ 
mating  the  computation  and  composition  of  behaviors  in  the  languages  employed  to  express 
such  artifacts.  These  languages,  whatever  their  level  of  abstraction,  embody  definitions  of  the 
behavioral  semantics  of  their  structures.  Function  Extractor  development  begins  with  a  well- 
defined  language  whose  semantics  can  be  captured  in  terms  of  the  functions  of  its  structures 
and  the  rules  that  govern  their  combination.  Any  system  artifact  written  in  that  language  can 
then  be  submitted  to  the  Function  Extractor,  which  will  apply  the  functional  semantics  of  the 
structures  to  produce  as  output  a  catalog  containing  all  cases  of  behavior  defined  by  the  arti¬ 
fact.  This  behavior  is  expressed  in  non-procedural  form,  essentially  defining  the  as-built 
specification  of  the  artifact  in  terms  of  its  mapping  of  inputs  into  outputs. 

In  a  miniature  illustration,  consider  the  following  sequence  of  operations  on  small  integers  x 
and  y  (machine  precision  is  set  aside;  however,  the  semantics  of  finite  operations  could  be 
incorporated  if  necessary): 

do 

x  :=  x  +  y 

y  ;=x-y 

x  :=  x  -  y 
enddo 

In  this  case,  the  behavioral  semantics  of  each  operation  involves  deriving  the  value  of  the 
right-hand-side  expression  and  assigning  it  to  the  variable  on  the  left.  The  rule  of  combina¬ 
tion  for  a  sequence  of  operations  is  ordinary  function  composition,  easily  expressed  in  the 
following  trace  table  and  associated  algebraic  derivations  that  compute  the  net  functional  be¬ 
havior  from  input  to  output  in  non-procedural  terms: 


2 


C  MU/SEI-2005-TN-047 


Assignment 

Value  of  x 

Value  of  y 

x  :=  x  +  y 

xl  =  xO  +  yO 

yi  =  yo 

y  :=  x  -  y 

x2  =  xl 

y2  =  xl  -  yi 

X 

if 

X 

1 

*< 

x3  =  x2  -  y2 

II 

CO 

> 

x3  =  x2  -  y2  y3  =  y2 

=  xl-(xl-yl)  =xl-yl 

=  yl  =xO  + yO-yO 

=  yO  =  xO 

Thus,  the  computed  behavior  is 
x,  y  :=  y,  x 

that  is,  the  values  of  x  and  y  are  exchanged  by  the  sequence  of  operations.  It  is  important  to 
note  that  this  computed  behavior  represents  exactly  what  the  program  does;  it  is  now  unnec¬ 
essary  to  read  and  inspect  the  code  in  an  attempt  to  derive  this  information.  This  example 
illustrates  the  Function  Extraction  process  for  a  sequence  control  structure.  A  function  theo¬ 
rem  defines  the  mapping  of  all  control  structures  into  such  functional  forms,  and  is  the 
mathematical  foundation  of  FX  technology. 

In  a  more  general  explanation,  the  function-theoretic  model  of  software  treats  programs  as 
rules  for  mathematical  functions  or  relations  [Hevner  02,  Hevner  05,  Hoffman  01,  Linger  79, 
McCarthy  63,  Mills  86,  Mills  02,  Pleszkoch  90,  Pleszkoch  04,  Prowell  99].  While  sizable 
programs  can  contain  a  virtually  infinite  number  of  execution  paths,  they  are  constructed  of  a 
finite  number  of  nested  and  sequenced  control  structures,  each  of  which  makes  a  finite  con¬ 
tribution  to  overall  behavior.  These  structures  correspond  to  mathematical  functions  or  rela¬ 
tions,  that  is,  mappings  from  inputs  to  outputs.  These  functional  mappings  can  be  automati¬ 
cally  extracted  in  a  stepwise  process  that  traverses  the  finite  control  structure  hierarchy.  At 
each  step,  details  of  local  code  and  data  are  abstracted  out,  while  their  net  effects  are  pre¬ 
served  and  propagated  in  the  extracted  behavior.  While  no  general  theory  for  loop  abstraction 
can  exist,  use  of  recursive  expressions  and  patterns  for  loops  provides  an  engineering  solu¬ 
tion. 

Function  Extraction  has  potential  for  widespread  application  across  the  software  engineering 
life  cycle,  as  discussed  in  Hevner  [Hevner  05].  Of  special  interest  is  use  of  FX  for  maintain¬ 
ing  and  evolving  legacy  systems,  as  well  as  for  developing  new  systems.  The  technology  can 
also  play  a  key  role  in  understanding  malicious  code;  CERT  is  currently  developing  an  FX- 
based  system  for  analyzing  malicious  code  expressed  in  Intel  Assembler  Language.  And  be¬ 
cause  behavior  computation  is  a  key  part  of  both  software  verification  at  the  program  level 
and  component  composition  at  the  system  level,  FX  application  is  possible  in  these  areas  as 
well. 
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3  The  Function  Extraction  Controlled 
Experiment 


The  CERT®  organization  of  the  Carnegie  Mellon®  Software  Engineering  Institute  has  imple¬ 
mented  Function  Extraction  technology  in  an  FX  prototype  that  operates  on  a  small  subset  of 
the  Java  programming  language.  The  prototype  takes  in  a  Java  program  written  in  the  lan¬ 
guage  subset  and  automatically  calculates  and  displays  its  functional  behavior.  The  behavior 
is  expressed  in  a  non-procedural,  user-readable  format  in  terms  of  how  the  outputs  of  the 
program  are  produced  from  its  inputs  in  all  possible  uses,  in  effect  producing  the  as-coded 
specification  of  the  program. 

CERT  has  performed  a  formal,  controlled  experiment  to  quantify  the  impact  of  FX  technol¬ 
ogy  on  the  ability  of  programmers  to  comprehend  and  verify  programs.  The  experimental 
subjects  were  26  Carnegie  Mellon  University  (CMU)  graduate  students  with  substantial 
computer  science  education  and  software  development  experience.  The  experiment  was  ap¬ 
proved  by  the  CMU  Institutional  Review  Board  (IRB),  calibrated  in  two  pilot  tests,  and  con¬ 
ducted  according  to  rigorous  experimental  protocols. 

The  26  subjects  signed  IRB  consent  forms  and  received  45  minutes  of  classroom  instruction 
on  the  purpose  of  the  experiment  and  the  process  of  program  comprehension.  They  were  then 
randomly  divided  into  two  groups:  the  control  group  (manual  manipulation)  and  the  experi¬ 
mental  group  (automated  FX  manipulation): 

•  The  control  group  subjects  were  given  three  Java  programs  of  varying  size  and  difficulty, 
together  with  functional  requirements  for  the  programs  and  a  set  of  questions  to  answer. 
They  applied  traditional  manual  methods  of  reading  and  inspection  to  understand  the  be¬ 
havior  of  each  program  and  then  answered  the  questions.  All  activities  were  self-timed  by 
the  subjects,  and  a  post-hoc  questionnaire  was  completed. 

•  The  experimental  group  subjects  installed  the  Function  Extraction  prototype  on  their  per¬ 
sonal  laptop  computers.  These  subjects,  who  had  no  previous  exposure  to  the  prototype, 
received  an  additional  45  minutes  of  classroom  instruction  on  its  use  and  were  then  given 
the  same  three  Java  programs,  requirements,  and  questions.  The  experimental  group  ran 
the  programs  through  the  FX  prototype  to  derive  and  display  their  functional  behavior, 
and  then  answered  the  questions.  Again,  all  activities  were  self  timed  by  the  subjects  and 
a  post-hoc  questionnaire  was  completed. 


CERT  is  registered  in  the  U.S.  Patent  and  Trademark  Office  by  Carnegie  Mellon  University. 
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Both  the  time  required  to  perform  the  experimental  tasks  and  the  correctness  of  the  answers 
were  measured.  The  correctness  of  the  answers  provided  a  measure  of  the  ability  of  both 
groups  to  understand  the  program  behaviors  and  to  verify  the  behaviors  against  requirements. 

The  validity  of  the  experimental  design  and  procedures  was  assessed  in  four  ways.  Partici¬ 
pants 

1.  answered  the  manipulation  check  question  appropriately,  which  indicated  that  they  un¬ 
derstood  their  task  setting  (control  or  experimental  group)  accurately 

2.  ranked  and  rated  the  difficulty  of  each  of  the  three  programs,  named  Algorithm,  Order¬ 
ing,  and  Bonus  Points.  The  participants’  assessments  were  consistent  with  the  experi¬ 
mental  design.  The  shortest,  the  Algorithm  program,  was  ranked  as  the  least  difficult  of 
the  three,  and  rated  as  less  difficult  than  the  participants’  usual  program  comprehension 
tasks.  The  Ordering  program  is  longer  than  the  Algorithm  program.  It  was  ranked  as 
more  difficult  to  comprehend  and  rated  at  about  the  same  level  of  difficulty  as  the  par¬ 
ticipants’  usual  program  comprehension  tasks.  The  longest  Bonus  Points  program  was 
ranked  as  the  most  difficult  and  rated  more  difficult  than  usual  comprehension  tasks. 

3.  agreed  that  the  training  they  received  on  program  comprehension  was  sufficient  to  com¬ 
plete  the  study  tasks,  and  the  participants  who  used  the  FX  prototype  strongly  agreed 
that  the  training  they  received  on  its  use  was  sufficient 

4.  were  experienced  with  program  comprehension  tasks  and  were  randomly  assigned.  They 
had  experience  in  reading  and  verifying  computer  code  (mean  =  5.81  years,  range  =  2- 
15  years)  and  had  taken  multiple  programming  classes  (mean  =  8.5  classes,  range  =  3-20 
classes).  In  addition,  most  participants  had  paid,  non-classroom  experience  in  reading 
and  verifying  code  (mean  =  1.34  years,  range  =  0-5).  Tests  for  equality  of  variance  and 
means  reveals  that  the  control  and  experimental  groups  did  not  differ  significantly  on 
any  of  the  three  experience  variables  (see  Table  1).  These  tests  are  conducted  to  rule  out 
the  possibility  that,  by  chance,  more  experienced  individuals  were  assigned  to  either  the 
control  or  experimental  group.  (If  significantly  more  experienced  individuals  had  been 
put  into  the  experimental  group,  then  that  higher  level  of  experience  would  provide  a  ri¬ 
val  explanation  for  the  results.)  Because  none  of  these  tests  have  significant  results  ip 
values  are  all  greater  than  .10),  there  is  no  significant  difference  between  the  control  and 
experimental  groups  in  their  experience  in  program  comprehension;  paid,  non-classroom 
experience;  or  number  of  programming  classes  taken. 
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Table  1:  Comparison  of  Control  and  Experimental  Groups  on  Experience 


Statistical  Test 

Years  of 
Experience  in 
Reading  and 
Verifying  Code 

Number  of 
Programming 
Classes 

Years  of  Paid, 
Non-Classroom, 

IT  Experience 

Levene’s  Test  for  Equality  of 

F  =  2.716 

F  =  .207 

F  =  .268 

Variance 

p  =  .1 12 

p  =  .654 

p=  .610 

Test  for  Independent  Samples 

f  =  .937 

(  =  .101 

(=-0.333 

(t-test  for  equality  of  means) 

p  =  .358 

p  =  .921 

p=  .742 
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4  FX  Experimental  Results 


The  results  of  the  experiment  strongly  demonstrate  the  significant,  positive  impact  of  use  of 
the  FX  prototype  on  subjects'  performance,  measured  by  time  on  task  and  accuracy  of  pro¬ 
gram  comprehension,  as  well  as  on  subjects’  satisfaction  with  using  the  prototype.  Subject 
performance  on  the  experimental  tasks  (descriptive  data  are  summarized  in  Table  2)  was 
measured  by  the  amount  of  time  required  to  complete  each  task  and  the  accuracy  of  answers 
to  the  questions  that  tested  program  comprehension.  Subjects  reported  the  time  they  started 
and  stopped  each  task,  as  well  as  their  estimates  of  how  that  time  was  allocated  among  the 
following  three  components  of  each  task: 

1.  understanding  the  program  requirements  and  program  comprehension  questions 

2.  determining  the  functionality  of  the  programs 

3.  recording  answers  to  the  program  comprehension  questions  on  the  form 


Table  2:  Descriptive  Data  on  Program  Comprehension  Performance 


Algorithm  Program 

(least  difficult) 

Ordering  Program 
(average  difficulty) 

Bonus  Points  Program 

(most  difficult) 

Time  on 
Task 
(minutes) 

Percentage 
correct  out 
of  5 

questions 

Time  on 
Task 
(minutes) 

Percentage 
correct  out 
of  10 
questions 

Time  on 
Task 
(minutes) 

Percentage 
correct  out 
of  10 
questions 

Control  Group: 

Traditional 

Method 

Mean  = 

9.15 

Range  = 
5-12 

Mean  = 

82% 

Range: 

60-100% 

Mean  = 

24.4 

Range  = 
15-35 

(3  did  not 
complete) 

Mean  = 

73% 

Range  = 
50-90% 

Mean  = 

57 

Range  = 
29-89 

Mean  = 

23% 

Range  =  0- 
80% 

Experimental 

Group: 

FX  Support 

Mean  = 

5.62 

Range  = 
3-10 

Mean  = 

14.2 

Range  = 
7-20 

Mean  = 

89% 

Range  = 
40-100% 

Mean  = 

13.5 

Range  = 
5-18 

Mean  = 

83% 

Range  = 
65-95% 

The  descriptive  data  on  how  subjects  divided  their  time  are  reported  in  Table  3.  Two  rows  in 
this  table  are  highlighted  to  show  the  contrast  between  the  two  methods  of  program  compre¬ 
hension:  with  the  FX  prototype  little  or  no  time  was  spent  determining  program  functionality, 
while  with  the  traditional  method  the  majority  of  time  was  spent  reading  and  interpreting 
code  to  determine  its  functionality. 
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The  subjects  in  the  experimental  group  also  evaluated  the  FX  prototype  on  several  standard 
criteria  and  their  measures  [Venkatesh  00,  Wang  05].  The  evaluation  criteria,  the  scale  reli¬ 
ability  (which  identified  two  items  for  removal  in  order  to  achieve  adequate  reliability  of 
Cronbach’s  alpha  >.70),  and  results  are  shown  in  Table  4. 


Table  3:  Mean  Percentages  of  Estimates  of  How  Time  Was  Spent 


Algorithm 

Program 

Ordering 

Program 

Bonus  Points 
Program 

Control  Group:  Traditional  Method 

1 .  Understanding  program  requirements  and  questions 

12.4%  =  1.1  min 

17.8%  =  4.3  min 

5.0%  =  2.9  min 

2.  Determining  program  functionality 

62.3%  =  5.7  min 

61 .6%  =  15.0  min 

81 .1%  =  46  min 

3.  Recording  answers  on  form 

25.3%  =  2.3  min 

20.6%  =  5.0  min 

13.9%  =  7.9  min 

Experimental  Group:  FX  Automation 

1 .  Understanding  program  requirements  and  questions 

28.5%  =  1 .6  min 

46.5%  =  6.6  min 

35.2%  =  4.7  min 

2.  Determining  program  functionality 

0.3%  =  0.04  min 

3.  Recording  answers  on  form 

71 .2%  =  4.0  min 

53.3%  =  7.7  min 

65.3%  =  8.8  min 

Table  4:  Participant  Evaluation  of  the  FX  Prototype 


Evaluation 

Criterion 

Definition 

Reliability 

(Cronbach’s  alpha) 

Ratings  (n=13) 

Rating  of  1  =  strongly  disagree 
Rating  of  5  =  strongly  agree 

Perceived 

Usefulness 

extent  to  which  participants  believe 
that  using  the  FX  prototype  will  en¬ 
hance  job  performance 

.755 

Mean  =  4.1 

Range  =  3-5 

Output  Quality 

participants’  assessment  of  how  well 
the  FX  prototype  performs  tasks 
relevant  to  the  participants’  job 

.716 

Mean  =  3.7 

Range  =  2-5 

Technical  Utility 

participants’  assessment  of  the 
value,  innovativeness,  and  useful¬ 
ness  of  FX  prototype 

.780 

Mean  =  4.25 

Range  =  3-5 

Perceived 
Ease  of  Use 

extent  to  which  participants  believe 
that  using  the  FX  prototype  is  free  of 
effort 

.824 

(2  items  removed) 

Mean  =  4.5 

Range  =  3-5 

Intention  to  Use 

participants’  intention  to  use  the  FX 
prototype  in  the  future 

.941 

Mean  =  4.2 

Range  =  3-5 

Task  Support 

participants’  assessment  of  the 
amount  of  work  on  task  that  was  not 
supported  by  the  FX  prototype 

N/A 

(1  item,  no 
reliability 
calculated) 

Mean  =  2.5 

Range  =  1-5 

Statistical  analysis  of  the  study  data  enables  tests  of  significance  of  the  program  comprehen¬ 
sion  findings.  Performance  on  the  three  experimental  tasks  was  measured  by  three  dependent 
variables:  accuracy  of  program  comprehension,  total  time  on  task,  and  time  required  to  derive 
program  behavior.  Study  data  were  analyzed  using  analysis  of  variance  (ANOVA),  at  an  al¬ 
pha  level  of  .05.  (An  alpha  level  for  a  statistical  test  represents  the  probability  that  a  signifi- 
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cant  result  from  that  test  is  found  when  in  fact  there  is  no  significant  difference.  In  experi¬ 
mental  research  an  alpha  level  of  .05  is  typically  used.)  ANOVA  is  commonly  used  to  test  for 
significant  differences  between  the  control  and  experimental  groups.  In  this  study  nine  tests 
were  run  to  assess  whether  or  not  there  were  significant  differences  between  groups  in  the 
three  types  of  program  performance  for  each  of  the  three  study  tasks.  The  results  of  these 
tests  are  shown  in  Table  5. 


Table  5:  Results  of  the  ANOVA  for  Program  Comprehension  Performance 


Performance 

Measure 

Algorithm  Program 

(least  difficulty) 

Ordering  Program 

(average  difficulty) 

Bonus  Points  Program 

(most  difficulty) 

Accuracy  of 
Program 
Comprehension 

F  =  6.854 

F  =  6.251 

F*  =  75.489 

p=  .015 

p  =  .021 

p=.000 

Total  Time 

F  =  15.910 

F*=  14.988 

F*  =  69.527 

on  Task 

p=.001 

p  =  .002 

p=.000 

Time  Required  to 
Derive  Program 
Behavior 

F*  =  97.768 

p=.  000 

F*  =  62.446 

p  =  .000 

F*  =  174.719 

p=.000 

*  =  asymptotically  distributed  F  statistic  from  Welch/Brown-Forsythe  Robust  Test  of  Equality  of  Means 


A  critical  assumption  of  ANOVA  is  that  the  control  and  experimental  groups  have  equal  vari¬ 
ances.  For  those  tests  where  the  assumption  of  equal  variances  based  on  the  Levene  statistic 
is  not  met  (indicated  with  an  asterisk),  the  Welch/Brown-Forsythe  Robust  Test  of  Equality  of 
Means  was  used,  and  the  asymptotically  distributed  F  statistic  reported.  In  all  cases  the  F  sta¬ 
tistic  represents  a  ratio  of  how  much  the  observations  vary  within  each  of  the  groups  to  how 
much  the  observations  vary  between  groups.  When  the  F  statistic  is  near  1  it  indicates  that 
there  is  no  statistically  significant  difference  between  observations  of  the  control  and  experi¬ 
mental  groups.  Larger  values  of  F  indicate  that  the  groups’  means  differ.  The  p-value  is  a 
measure  of  the  statistical  strength  of  the  differences  that  are  observed.  The  smaller  the  p- 
value,  the  stronger  the  evidence  is  of  statistical  significance.  It  represents  the  probability  that 
if  we  repeated  the  same  experiment  we  would  get  results  indicating  there  was  no  difference 
between  the  groups.  (In  practical  terms,  very  low  p-values  indicate  a  very  low  probability  that 
the  results  were  achieved  by  chance.) 

The  results  reported  in  Table  5  provide  very  strong  evidence  that  use  of  the  FX  prototype  has 
a  positive  impact  on  program  comprehension  performance:  significantly  better  accuracy  of 
comprehension  in  significantly  less  time.  The  results  also  show  that  the  FX  prototype  does  in 
fact  automate  the  derivation  of  program  behavior,  since  the  group  using  the  FX  prototype 
required  significantly  less  time  for  this  part  of  the  task,  and,  as  reported  in  Table  3,  this  time 
averaged  less  than  1%  of  experimental  subjects’  total  time  on  task. 
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5  Analysis  of  the  Experimental  Results 


The  following  observations  are  based  on  analysis  of  the  experimental  data: 

1.  Use  of  the  FX  prototype  significantly  reduces  the  time  required  to  derive  program 
behavior  by  truly  automating  this  crucial  and  time-consuming  part  of  program 
comprehension.  Subjects  who  used  the  FX  prototype  experienced  a  significant  reduc¬ 
tion  in  the  time  required  for  the  task  of  deriving  an  understanding  of  program  behavior 
compared  with  the  control  group.  Whether  in  the  control  or  experimental  groups,  sub¬ 
jects  had  to  spend  at  least  some  time  on  understanding  the  program  requirements  and  the 
program  comprehension  questions,  and  on  recording  their  answers.  But  the  control 
group  subjects  using  traditional  manual  methods  spent  most  of  their  time  (between  62% 
and  81%  of  total  task  time)  reading  and  analyzing  code  to  determine  program  behavior, 
and  this  percentage  increased  as  the  programs  became  longer  and  more  difficult.  In  con¬ 
trast,  the  subjects  using  the  FX  prototype  were  able  to  determine  program  behavior  di¬ 
rectly  from  the  prototype  output,  and  thus  spent  very  little  time  (between  .2%  and  .3%  of 
total  task  time)  on  this  part  of  the  program  comprehension  task.  Moreover,  the  percent¬ 
age  of  time  subjects  took  to  determine  pro  gram  functionality  using  the  FX  prototype  did 
not  increase  with  program  length  and  difficulty,  suggesting  that  improvements  will  be 
even  more  dramatic  with  further  increases  in  length  and  difficulty  of  the  programs 
analyzed. 

2.  Use  of  the  FX  prototype  significantly  improves  human  performance  in  program 
comprehension  and  verification.  Subjects  who  used  the  FX  prototype  produced  far 
more  correct  answers  to  the  comprehension  questions  than  the  control  group;  in  the  case 
of  the  longest,  most  difficult  program  by  a  factor  of  3.6.  They  also  required  far  less  time 
to  achieve  this  improved  comprehension  than  the  subjects  who  used  traditional  methods 
of  reading  and  inspecting  code.  While  the  differences  in  performance  were  significant 
for  all  programs,  the  difference  in  time  performance  was  relatively  small  for  the  initial, 
warm-up  task  (the  Algorithm  program)  but  increased  as  the  programs  increased  in  length 
and  difficulty,  ultimately  resulting  in  an  improvement  factor  of  4.2  for  the  Bonus  Points 
program.  That  is,  subjects  in  the  experimental  group  completed  this  most  difficult  task  in 
about  a  fourth  of  the  time  required  by  the  control  group.  Treating  productivity  as  a  ratio 
between  task  output  and  input  (in  this  case,  output  =  accurate  program  comprehension 
and  input  =  total  time  on  task),  the  FX  group  achieved  an  improvement  in  productivity 
on  the  order  of  a  factor  of  15  for  the  longest  and  most  difficult  program. 

3.  Developers  who  were  trained  on  and  used  the  FX  prototype  agree  that  it  is  useful, 
supports  the  comprehension  task,  and  is  easy  to  use.  The  large  improvement  in  pror 
gram  comprehension  for  subjects  using  the  FX  prototype  was  achieved  with  just  45  min- 
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utes  of  instruction,  contrasted  with  years  of  training  and  experience  in  manual  program 
reading  and  inspection  in  the  control  group.  As  stated  by  one  subject  in  response  to  an 
open-ended  question  about  how  well  the  FX  prototype  supports  program  comprehen¬ 
sion:  “The  FX  tool  was  a  good  tool  for  verifying  computer  code.  It  allows  you  to  get  an 
understanding  of  what  the  function  actually  does  without  being  forced  to  go  through  the 
code  line  by  line.  I  didn't  really  read  the  code  because  we  used  the  tool.” 

In  summary,  this  experiment  demonstrates  significant  advantages  for  FX  technology  over 
traditional  methods  of  code  reading  and  inspection.  This  objective  quantification  of  the  busi¬ 
ness  case  for  FX  will  guide  future  development  of  the  technology.  Additional  experimental 
studies  are  planned  to  further  evaluate  human  performance  in  program  comprehension  with 
and  without  FX  support.  These  studies  will  provide  additional  feedback  for  development  of 
FX  technology,  human  interfaces,  and  operational  processes. 
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